Grouping and summarizing
Summarizing the median life expectancy
You’ve seen how to find the mean life expectancy and the total population across a set of observations, but mean() and sum() are only two of the functions R provides for summarizing a collection of numbers. Here, you’ll learn to use the median() function in combination with summarize().
By the way, dplyr displays some messages when it’s loaded that we’ve been hiding so far. They’ll show up in red and start with:
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
This will occur in future exercises each time you load dplyr: it’s mentioning some built-in functions that are overwritten by dplyr. You won’t need to worry about this message within this course.
# Load the knitr and kableExtra packages
library(knitr)
library(kableExtra)
options(knitr.table.format = "html")
# Load the gapminder package
library(gapminder)
# Load the dpylr package
library(dplyr)
# Load the ggplot2 package as well
library(ggplot2)
theme_set(theme_bw()) # pre-set the bw theme.# Summarize to find the median life expectancy
gapminder %>%
summarize(medianLifeExp = median(lifeExp)) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| medianLifeExp |
|---|
| 60.7125 |
Summarizing the median life expectancy in 1957
Rather than summarizing the entire dataset, you may want to find the median life expectancy for only one particular year. In this case, you’ll find the median in the year 1957.
# Filter for 1957 then summarize the median life expectancy
gapminder %>%
filter(year == 1957) %>%
summarize(medianLifeExp = median(lifeExp)) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| medianLifeExp |
|---|
| 48.3605 |
Summarizing multiple variables in 1957
The summarize() verb allows you to summarize multiple variables at once. In this case, you’ll use the median() function to find the median life expectancy and the max() function to find the maximum GDP per capita.
# Filter for 1957 then summarize the median life expectancy and the maximum GDP per capita
gapminder %>%
filter(year == 1957) %>%
summarize(medianLifeExp = median(lifeExp), maxGdpPercap= max(gdpPercap)) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| medianLifeExp | maxGdpPercap |
|---|---|
| 48.3605 | 113523.1 |
Summarizing by year
Now, you’ll perform those two summaries within each year in the dataset, using the group_by verb.
# Find median life expectancy and maximum GDP per capita in each year
gapminder %>%
group_by(year) %>%
summarize(medianLifeExp = median(lifeExp), maxGdpPercap = max(gdpPercap)) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| year | medianLifeExp | maxGdpPercap |
|---|---|---|
| 1952 | 45.1355 | 108382.35 |
| 1957 | 48.3605 | 113523.13 |
| 1962 | 50.8810 | 95458.11 |
| 1967 | 53.8250 | 80894.88 |
| 1972 | 56.5300 | 109347.87 |
| 1977 | 59.6720 | 59265.48 |
| 1982 | 62.4415 | 33693.18 |
| 1987 | 65.8340 | 31540.97 |
| 1992 | 67.7030 | 34932.92 |
| 1997 | 69.3940 | 41283.16 |
| 2002 | 70.8255 | 44683.98 |
| 2007 | 71.9355 | 49357.19 |
Interesting: notice that median life expectancy across countries is generally going up over time, but maximum GDP per capita is not.
Summarizing by continent
You can group by any variable in your dataset to create a summary. Rather than comparing across time, you might be interested in comparing among continents. You’ll want to do that within one year of the dataset: let’s use 1957.
# Find median life expectancy and maximum GDP per capita in each continent in 1957
gapminder %>%
filter(year == 1957) %>%
group_by(continent) %>%
summarize(medianLifeExp = median(lifeExp), maxGdpPercap = max(gdpPercap)) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| continent | medianLifeExp | maxGdpPercap |
|---|---|---|
| Africa | 40.5925 | 5487.104 |
| Americas | 56.0740 | 14847.127 |
| Asia | 48.2840 | 113523.133 |
| Europe | 67.6500 | 17909.490 |
| Oceania | 70.2950 | 12247.395 |
Summarizing by continent and year
Instead of grouping just by year, or just by continent, you’ll now group by both continent and year to summarize within each.
gapminder %>%
group_by(continent, year) %>%
summarize(medianLifeExp = median(lifeExp), maxGdpPercap = max(gdpPercap)) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689") %>%
scroll_box(width = "100%", height = "300px")| continent | year | medianLifeExp | maxGdpPercap |
|---|---|---|---|
| Africa | 1952 | 38.8330 | 4725.296 |
| Africa | 1957 | 40.5925 | 5487.104 |
| Africa | 1962 | 42.6305 | 6757.031 |
| Africa | 1967 | 44.6985 | 18772.752 |
| Africa | 1972 | 47.0315 | 21011.497 |
| Africa | 1977 | 49.2725 | 21951.212 |
| Africa | 1982 | 50.7560 | 17364.275 |
| Africa | 1987 | 51.6395 | 11864.408 |
| Africa | 1992 | 52.4290 | 13522.158 |
| Africa | 1997 | 52.7590 | 14722.842 |
| Africa | 2002 | 51.2355 | 12521.714 |
| Africa | 2007 | 52.9265 | 13206.485 |
| Americas | 1952 | 54.7450 | 13990.482 |
| Americas | 1957 | 56.0740 | 14847.127 |
| Americas | 1962 | 58.2990 | 16173.146 |
| Americas | 1967 | 60.5230 | 19530.366 |
| Americas | 1972 | 63.4410 | 21806.036 |
| Americas | 1977 | 66.3530 | 24072.632 |
| Americas | 1982 | 67.4050 | 25009.559 |
| Americas | 1987 | 69.4980 | 29884.350 |
| Americas | 1992 | 69.8620 | 32003.932 |
| Americas | 1997 | 72.1460 | 35767.433 |
| Americas | 2002 | 72.0470 | 39097.100 |
| Americas | 2007 | 72.8990 | 42951.653 |
| Asia | 1952 | 44.8690 | 108382.353 |
| Asia | 1957 | 48.2840 | 113523.133 |
| Asia | 1962 | 49.3250 | 95458.112 |
| Asia | 1967 | 53.6550 | 80894.883 |
| Asia | 1972 | 56.9500 | 109347.867 |
| Asia | 1977 | 60.7650 | 59265.477 |
| Asia | 1982 | 63.7390 | 33693.175 |
| Asia | 1987 | 66.2950 | 28118.430 |
| Asia | 1992 | 68.6900 | 34932.920 |
| Asia | 1997 | 70.2650 | 40300.620 |
| Asia | 2002 | 71.0280 | 36023.105 |
| Asia | 2007 | 72.3960 | 47306.990 |
| Europe | 1952 | 65.9000 | 14734.233 |
| Europe | 1957 | 67.6500 | 17909.490 |
| Europe | 1962 | 69.5250 | 20431.093 |
| Europe | 1967 | 70.6100 | 22966.144 |
| Europe | 1972 | 70.8850 | 27195.113 |
| Europe | 1977 | 72.3350 | 26982.291 |
| Europe | 1982 | 73.4900 | 28397.715 |
| Europe | 1987 | 74.8150 | 31540.975 |
| Europe | 1992 | 75.4510 | 33965.661 |
| Europe | 1997 | 76.1160 | 41283.164 |
| Europe | 2002 | 77.5365 | 44683.975 |
| Europe | 2007 | 78.6085 | 49357.190 |
| Oceania | 1952 | 69.2550 | 10556.576 |
| Oceania | 1957 | 70.2950 | 12247.395 |
| Oceania | 1962 | 71.0850 | 13175.678 |
| Oceania | 1967 | 71.3100 | 14526.125 |
| Oceania | 1972 | 71.9100 | 16788.629 |
| Oceania | 1977 | 72.8550 | 18334.198 |
| Oceania | 1982 | 74.2900 | 19477.009 |
| Oceania | 1987 | 75.3200 | 21888.889 |
| Oceania | 1992 | 76.9450 | 23424.767 |
| Oceania | 1997 | 78.1900 | 26997.937 |
| Oceania | 2002 | 79.7400 | 30687.755 |
| Oceania | 2007 | 80.7195 | 34435.367 |
Visualizing summarized data
Visualizing median life expectancy over time
In the last chapter, you summarized the gapminder data to calculate the median life expectancy within each year. Created as the by_year dataset.
Now you can use the ggplot2 package to turn this into a visualization of changing life expectancy over time.
by_year <- gapminder %>%
group_by(year) %>%
summarize(medianLifeExp = median(lifeExp),
maxGdpPercap = max(gdpPercap))# Create a scatter plot showing the change in medianLifeExp over time
ggplot(by_year, aes(x = year, y = medianLifeExp)) +
geom_point() +
expand_limits(y = 0) +
labs(subtitle="Life expectancy over time",
y="Life expectancy",
x="Year",
title="Scatterplot",
caption = "")It looks like median life expectancy across countries is increasing over time.
Visualizing median GDP per capita per continent over time
In the last exercise you were able to see how the median life expectancy of countries changed over time. Now you’ll examine the median GDP per capita instead, and see how the trend differs among continents.
# Summarize medianGdpPercap within each continent within each year:
by_year_continent <- gapminder %>%
group_by(continent, year) %>%
summarize(medianGdpPercap = median(gdpPercap))
# Plot the change in medianGdpPercap in each continent over time
ggplot(by_year_continent, aes(x = year, y = medianGdpPercap, color = continent)) +
geom_point() +
expand_limits(y = 0) +
labs(subtitle="Median GDP per capita over time by continent",
y="GDP per capita",
x="Year",
title="Scatterplot",
caption = "")Comparing median life expectancy and median GDP per continent in 2007
In these exercises you’ve generally created plots that show change over time. But as another way of exploring your data visually, you can also use ggplot2 to plot summarized data to compare continents within a single year.
# Summarize the median GDP and median life expectancy per continent in 2007
by_continent_2007 <- gapminder %>%
filter(year == 2007) %>%
group_by(continent) %>%
summarize(medianLifeExp = median(lifeExp), medianGdpPercap = median(gdpPercap))
# Use a scatter plot to compare the median GDP and median life expectancy
ggplot(by_continent_2007, aes(x = medianGdpPercap, y = medianLifeExp, color = continent)) +
geom_point() +
expand_limits(y = 0) +
labs(subtitle="Median life expectancy with median GDP per continent in 2007",
y="Median Life expectancy",
x="Median GDP per capita",
title="Scatterplot",
caption = "")